54 research outputs found

    Two new Probability inequalities and Concentration Results

    Full text link
    Concentration results and probabilistic analysis for combinatorial problems like the TSP, MWST, graph coloring have received much attention, but generally, for i.i.d. samples (i.i.d. points in the unit square for the TSP, for example). Here, we prove two probability inequalities which generalize and strengthen Martingale inequalities. The inequalities provide the tools to deal with more general heavy-tailed and inhomogeneous distributions for combinatorial problems. We prove a wide range of applications - in addition to the TSP, MWST, graph coloring, we also prove more general results than known previously for concentration in bin-packing, sub-graph counts, Johnson-Lindenstrauss random projection theorem. It is hoped that the strength of the inequalities will serve many more purposes.Comment: 3

    Clustering with Spectral Norm and the k-means Algorithm

    Full text link
    There has been much progress on efficient algorithms for clustering data points generated by a mixture of kk probability distributions under the assumption that the means of the distributions are well-separated, i.e., the distance between the means of any two distributions is at least Ω(k)\Omega(k) standard deviations. These results generally make heavy use of the generative model and particular properties of the distributions. In this paper, we show that a simple clustering algorithm works without assuming any generative (probabilistic) model. Our only assumption is what we call a "proximity condition": the projection of any data point onto the line joining its cluster center to any other cluster center is Ω(k)\Omega(k) standard deviations closer to its own center than the other center. Here the notion of standard deviations is based on the spectral norm of the matrix whose rows represent the difference between a point and the mean of the cluster to which it belongs. We show that in the generative models studied, our proximity condition is satisfied and so we are able to derive most known results for generative models as corollaries of our main result. We also prove some new results for generative models - e.g., we can cluster all but a small fraction of points only assuming a bound on the variance. Our algorithm relies on the well known kk-means algorithm, and along the way, we prove a result of independent interest -- that the kk-means algorithm converges to the "true centers" even in the presence of spurious points provided the initial (estimated) centers are close enough to the corresponding actual centers and all but a small fraction of the points satisfy the proximity condition. Finally, we present a new technique for boosting the ratio of inter-center separation to standard deviation

    Spectral Approaches to Nearest Neighbor Search

    Full text link
    We study spectral algorithms for the high-dimensional Nearest Neighbor Search problem (NNS). In particular, we consider a semi-random setting where a dataset PP in Rd\mathbb{R}^d is chosen arbitrarily from an unknown subspace of low dimension kdk\ll d, and then perturbed by fully dd-dimensional Gaussian noise. We design spectral NNS algorithms whose query time depends polynomially on dd and logn\log n (where n=Pn=|P|) for large ranges of kk, dd and nn. Our algorithms use a repeated computation of the top PCA vector/subspace, and are effective even when the random-noise magnitude is {\em much larger} than the interpoint distances in PP. Our motivation is that in practice, a number of spectral NNS algorithms outperform the random-projection methods that seem otherwise theoretically optimal on worst case datasets. In this paper we aim to provide theoretical justification for this disparity.Comment: Accepted in the proceedings of FOCS 2014. 30 pages and 4 figure

    Random Separating Hyperplane Theorem and Learning Polytopes

    Full text link
    The Separating Hyperplane theorem is a fundamental result in Convex Geometry with myriad applications. Our first result, Random Separating Hyperplane Theorem (RSH), is a strengthening of this for polytopes. \rsh asserts that if the distance between aa and a polytope KK with kk vertices and unit diameter in d\Re^d is at least δ\delta, where δ\delta is a fixed constant in (0,1)(0,1), then a randomly chosen hyperplane separates aa and KK with probability at least 1/poly(k)1/poly(k) and margin at least Ω(δ/d)\Omega \left(\delta/\sqrt{d} \right). An immediate consequence of our result is the first near optimal bound on the error increase in the reduction from a Separation oracle to an Optimization oracle over a polytope. RSH has algorithmic applications in learning polytopes. We consider a fundamental problem, denoted the ``Hausdorff problem'', of learning a unit diameter polytope KK within Hausdorff distance δ\delta, given an optimization oracle for KK. Using RSH, we show that with polynomially many random queries to the optimization oracle, KK can be approximated within error O(δ)O(\delta). To our knowledge this is the first provable algorithm for the Hausdorff Problem. Building on this result, we show that if the vertices of KK are well-separated, then an optimization oracle can be used to generate a list of points, each within Hausdorff distance O(δ)O(\delta) of KK, with the property that the list contains a point close to each vertex of KK. Further, we show how to prune this list to generate a (unique) approximation to each vertex of the polytope. We prove that in many latent variable settings, e.g., topic modeling, LDA, optimization oracles do exist provided we project to a suitable SVD subspace. Thus, our work yields the first efficient algorithm for finding approximations to the vertices of the latent polytope under the well-separatedness assumption

    Principal Component Analysis and Higher Correlations for Distributed Data

    Full text link
    We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We present algorithms for two illustrative problems on massive data sets: (1) computing a low-rank approximation of a matrix A=A1+A2++AsA=A^1 + A^2 + \ldots + A^s, with matrix AtA^t stored on server tt and (2) computing a function of a vector a1+a2++asa_1 + a_2 + \ldots + a_s, where server tt has the vector ata_t; this includes the well-studied special case of computing frequency moments and separable functions, as well as higher-order correlations such as the number of subgraphs of a specified type occurring in a graph. For both problems we give algorithms with nearly optimal communication, and in particular the only dependence on nn, the size of the data, is in the number of bits needed to represent indices and words (O(logn)O(\log n)).Comment: rewritten with focus on two main results (distributed PCA, higher-order moments and correlations) in the arbitrary partition mode

    Characterization of a distinct lethal arteriopathy syndrome in twenty-two infants associated with an identical, novel mutation in FBLN4 gene, confirms fibulin-4 as a critical determinant of human vascular elastogenesis

    Get PDF
    Background: Vascular elasticity is crucial for maintaining hemodynamics. Molecular mechanisms involved in human elastogenesis are incompletely understood. We describe a syndrome of lethal arteriopathy associated with a novel, identical mutation in the fibulin 4 gene (FBLN4) in a unique cohort of infants from South India. Methods: Clinical characteristics, cardiovascular findings, outcomes and molecular genetics of twenty-two infants from a distinct population subgroup, presenting with characteristic arterial dilatation and tortuosity during the period August 2004 to June 2011 were studied. Results: Patients (11 males, 11 females) presented at median age of 1.5 months, belonging to unrelated families from identical ethno-geographical background; eight had a history of consanguinity. Cardiovascular features included aneurysmal dilatation, elongation, tortuosity and narrowing of the aorta, pulmonary artery and their branches. The phenotype included a variable combination of cutis laxa (52%), long philtrum-thin vermillion (90%), micrognathia (43%), hypertelorism (57%), prominent eyes (43%), sagging cheeks (43%), long slender digits (48%), and visible arterial pulsations (38%). Genetic studies revealed an identical c.608A > C (p. Asp203Ala) mutation in exon 7 of the FBLN4 gene in all 22 patients, homozygous in 21, and compound heterozygous in one patient with a p. Arg227Cys mutation in the same conserved cbEGF sequence. Homozygosity was lethal (17/21 died, median age 4 months). Isthmic hypoplasia (n = 9) correlated with early death (<= 4 months). Conclusions: A lethal, genetic disorder characterized by severe deformation of elastic arteries, was linked to novel mutations in the FBLN4 gene. While describing a hitherto unreported syndrome in this population subgroup, this study emphasizes the critical role of fibulin-4 in human elastogenesis

    Algorithmic Geometry of Numbers

    No full text
    this article - Algorithmic Geometry of Numbers. The fundamental basis reduction algorithm of Lov&apos;asz which first appeared in Lenstra, Lenstra, Lov&apos;asz [46] was used in Lenstra&apos;s algorithm for integer programming and has since been applied in myriad contexts-starting with factorization of polynomials (A.K. Lenstra, [45]). Classical Geometry of Numbers has a special feature in that it studies the geometric properties of (convex) sets like volume, width etc. which come from the realm of continuous mathematics in relation to lattices which are discrete objects. This makes it ideal for applications to integer programming and other discrete optimization problems which seem inherently harder than their &quot;continuous&quot; counterparts like linear programming.
    corecore